AITopics

doi: 10.1145/3770855.3817766

2605.26919

Country:

North America > United States (1.00)
Europe (0.93)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Apte, Anuj, Deshpande, Pranav, Kumar, Niraj, Chakrabarti, Shouvanik, Kim, Junhyung Lyle

Anytime Training with Schedule-Free Spectral Optimization

arXiv.org Machine LearningMay-25-2026

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

decay, machine learning, natural language, (19 more...)

2605.23061

Country: North America > United States (0.92)

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Balasubramanian, Krishnakumar

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

arXiv.org Machine LearningMay-21-2026

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter $μ$. On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for $0<μ<2$, it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

artificial intelligence, gradient descent, machine learning, (16 more...)

2605.21292

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

arXiv.org Machine LearningMay-20-2026

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

Defazio, Aaron

Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

large language model, machine learning, natural language, (19 more...)

2605.19095

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsApr-30-2026, 05:08:32 GMT

ec52572b9e16b91edff5dc70e2642240-Paper-Conference.pdf

artificial intelligence, machine learning, optimization problem, (16 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Neural Information Processing SystemsApr-30-2026, 02:27:37 GMT

Supplementary Material

All code can be downloaded from https://github.com/Shanka123/OCRA, Figure task is to S1: say Abstract whether Reasoning they are the T same asks (AR or dif T). ferent. Same/differ Relational ent: matc Two h-to-sample: objects are presented, A source and pair the of objects is presented that either instantiates a'same' or'different' relation, and the task is to select the pair in a 2 of tar 2 get array objects format, (out with of tw the o pairs) source th pair at instantiates presented in the the same top relation. The of task is to select the missing object from a set of four choices. Problems were presented in a 2 3 array each answer format, choice, with one see of Figure the answer S8). Identity choices rules: inserted An into abstract the bottom pattern right is instantiated cell (separate in the images first ro for w (AB instantiated A, ABB, in or the AAA), second and ro the w.

artificial intelligence, epoch, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Neural Information Processing SystemsApr-30-2026, 01:06:20 GMT

Operator Learning with Neural Fields: Tackling PDEs on General Geometries Supplemental Material Anonymous Author(s) Affiliation Address email

A.1 Initial Value Problem518 We use the datasets from Pfaff et al. (2021), and take the first and last frames of each trajectory as the519 input and output data for the initial value problem.520 Cylinder The dataset includes computational fluid dynamics (CFD) simulations of the flow around521 a cylinder, governed by the incompressible Navier-Stokes equation. These simulations were generated522 using COMSOL software, employing an irregular 2D-triangular mesh. The trajectory consists of 600523 timestamps, with a time interval of t =0 .01s between each timestamp.524 Airfoil The dataset contains CFD simulations of the flow around an airfoil, following the com-525 pressible Navier-Stokes equation. These simulations were conducted using SU2 software, using an526 irregular 2D-triangular mesh. The trajectory encompasses 600 timestamps, with a time interval of527 t =0 .008s between each timestamp.528 A.2 Dynamics Modeling529 2D-Navier-Stokes (Navier-Stokes) We consider the 2DNavier-Stokes equation as presented in Li530 et al. (2021); Yin et al. (2022).

artificial intelligence, machine learning, trajectory, (17 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsApr-29-2026, 19:50:12 GMT

ce26d21662c979d515164b416d4571fe-Paper-Conference.pdf

artificial intelligence, machine learning, meta-adam, (18 more...)

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Neural Information Processing SystemsApr-29-2026, 17:51:37 GMT

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalanceis based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing.

artificial intelligence, machine learning, tempbalance, (17 more...)